scaling law
D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models
Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1\% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.
Observational Scaling Laws and the Predictability of Langauge Model Performance
Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from ~100 publically available models. Building a single scaling law from multiple model families is challenging due to large variations in their training compute efficiencies and capabilities. However, we show that these variations are consistent with a simple, generalized scaling law where language model performance is a function of a low-dimensional capability space, and model families only vary in their efficiency in converting training compute to capabilities. Using this approach, we show the surprising predictability of complex scaling phenomena: we show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models; we show that the agent performance of models such as GPT-4 can be precisely predicted from simpler non-agentic benchmarks; and we show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models
Nimmaturi, Datta, Bhargava, Vaishnavi, Ghosh, Rajat, George, Johnu, Dutta, Debojyoti
Fine-tuning large language models (LLMs) for complex reasoning with reinforcement learning (RL) continues to be prohibitively expensive. Through a phenomenological investigation of GRPO post-training dynamics, we identify a scaling law characterized by exponential reward saturation. The emergence of this early plateau motivates an important question: can GRPO be equipped with principled early stopping criteria to significantly reduce post-training compute while preserving downstream performance? Across four open-source models--Llama 3B/8B and Qwen 3B/7B--we perform a systematic empirical study of GRPO fine-tuning and derive scaling laws that accurately predict reward trajectories during training. Our analysis shows that GRPO reward curves are well-approximated by an exponential saturation with three phases that are consistent across all models: (i) slow initial progress, (ii) rapid improvement, and (iii) saturation. We further show that a simple parametric scaling law, conditioned on model size, initial performance, and normalized training progress, reliably predicts the onset of plateauing performance. A key practical finding is that training beyond roughly 80% of a single epoch yields negligible reward gains while consuming a substantial fraction of total computation. Using our scaling law, practitioners can forecast these phase transitions early and select data-driven stopping points, substantially reducing GRPO compute without sacrificing final performance. Our results suggest that such predictive scaling laws are a promising tool for managing GRPO finetuning costs.
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
Bergsma, Shane, Dey, Nolan, Gosal, Gurpreet, Gray, Gavia, Soboleva, Daria, Hestness, Joel
Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate $η$ and weight decay $λ$. We study scaling laws for HPs: formulas for how to scale HPs as we scale model size N, dataset size D, and batch size B. Recent work suggests the AdamW timescale, $τ= B/(ηλD)$, should remain constant across training settings, and we verify the implication that optimal $λ$ scales linearly with B, for a fixed N and D. However, as N and D scale, we show optimal $τ$ obeys a precise power law in the tokens-per-parameter ratio, D/N. This law thus provides a method to accurately predict $λ$opt in advance of large-scale training. We also study scaling laws for optimal batch size Bopt (the B enabling lowest loss at a given N,D) and critical batch size Bcrit (the B beyond which further data parallelism becomes ineffective). In contrast to prior work, we find both Bopt and Bcrit scale as power laws in D, independent of model size, N. Finally, we analyze how these findings inform the real-world selection of Pareto-optimal N and D under dual training time and compute objectives. All experiments were run on Cerebras CS-3 systems.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Scaling Latent Reasoning via Looped Language Models
Zhu, Rui-Jie, Wang, Zixuan, Hua, Kai, Zhang, Tianyu, Li, Ziniu, Que, Haoran, Wei, Boyi, Wen, Zixin, Yin, Fan, Xing, He, Li, Lu, Shi, Jiajun, Ma, Kaijing, Li, Shanda, Kergan, Taylor, Smith, Andrew, Qu, Xingwei, Hui, Mude, Wu, Bohong, Min, Qiyang, Huang, Hongzhi, Zhou, Xun, Ye, Wei, Liu, Jiaheng, Yang, Jian, Shi, Yunfeng, Lin, Chenghua, Zhao, Enduo, Cai, Tianle, Zhang, Ge, Huang, Wenhao, Bengio, Yoshua, Eshraghian, Jason
Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model is available here: http://ouro-llm.github.io.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Jordan (0.04)
- (6 more...)
LLM Architecture, Scaling Laws, and Economics: A Quick Summary
The current standard architecture of Large Language Models (LLMs) with QKV self-attention is briefly summarized, including the architecture of a typical Transformer. Scaling laws for compute (flops) and memory (parameters plus data) are given, along with their present (2025) rough cost estimates for the parameters of present LLMs of various scales, including discussion of whether DeepSeek should be viewed as a special case. Nothing here is new, but this material seems not otherwise readily available in summary form.
- North America > United States > Texas > Travis County > Austin (0.40)
- Asia > Middle East > Jordan (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- Asia > China (0.04)
From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction
Yan, Bencheng, Lei, Yuejie, Zeng, Zhiyuan, Wang, Di, Lin, Kaiyi, Wang, Pengjie, Xu, Jian, Zheng, Bo
Despite massive investments in scale, deep models for click-through rate (CTR) prediction often exhibit rapidly diminishing returns - a stark contrast to the smooth, predictable gains seen in large language models. We identify the root cause as a structural misalignment: Transformers assume sequential compositionality, while CTR data demand combinatorial reasoning over high-cardinality semantic fields. Unstructured attention spreads capacity indiscriminately, amplifying noise under extreme sparsity and breaking scalable learning. To restore alignment, we introduce the Field-Aware Transformer (FAT), which embeds field-based interaction priors into attention through decomposed content alignment and cross-field modulation. This design ensures model complexity scales with the number of fields F, not the total vocabulary size n >> F, leading to tighter generalization and, critically, observed power-law scaling in AUC as model width increases. We present the first formal scaling law for CTR models, grounded in Rademacher complexity, that explains and predicts this behavior. On large-scale benchmarks, FAT improves AUC by up to +0.51% over state-of-the-art methods. Deployed online, it delivers +2.33% CTR and +0.66% RPM. Our work establishes that effective scaling in recommendation arises not from size, but from structured expressivity-architectural coherence with data semantics.
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > Montana > Roosevelt County (0.04)
- North America > United States > Texas > Clay County (0.04)
- (2 more...)
A Scaling Laws
While results presented in the main text of the paper show scaling by averaging across cortex, we can also examine scaling on a per-voxel basis. Model size increases in semantic models seem to be most beneficial for predicting amodal, post-auditory cognitive areas such as prefrontal cortex. Figure B.1: Performance of audio encoding models, averaged across all voxels in auditory cortex. Figure B.2: Performance of HuBERT models, averaged across voxels in cortex. Figure D.1: Long Context Artifact - An example of a long context artifact effect as measured on an Figure E.2: Histogram showing the slopes of voxelwise scaling laws for two OPT model sizes, shown Flatmaps presented in the main text only used one subject, S3 .
Training Optimal Large Diffusion Language Models
Ni, Jinjie, Liu, Qian, Du, Chao, Dou, Longxu, Yan, Hang, Wang, Zili, Pang, Tianyu, Shieh, Michael Qizhe
We introduce Quokka, the first systematic scaling law for diffusion language models (DLMs), encompassing both compute-constrained and data-constrained regimes, and studying the key modeling and optimization designs. Quokka is a good friend of Chinchilla and provides wider scopes. We hope the results would bring short-term practical guidance in DLMs training and long-term inspirations for the whole AI community.